library(robustbase)
library(psych)
library(MASS)
library(ellipse)
library(here)
library(DescTools)
library(knitr)
library(RobStatTM)
Following the definitions outlined during the very first class of the module, explicit robust estimators are readily available in R. There is not very much to say about it here, let us only try to recreate the table presented in slide 18 of the Introduction and explicit robust estimators section.
x <-
c(9.52, 9.68, 10.16, 9.96, 10.08, 9.99, 10.47, 9.91, 9.92, 15.21)
x_9 <- x[-10]
est_names <- c(
"$\\bar{x}$ ",
"$\\bar{x}_{0.1}$ ",
"$\\tilde{x}_{0.1}$",
"$Me(x)$",
"$\\hat{\\sigma}$",
"$MD(x)$",
"$MAD(x)$",
"$MADN(x)$",
"$IQR(x)$"
)
explicit_est_x_9 <- c(mean(x_9),
mean(x_9, trim = .1),
psych::winsor.mean(x = x_9, trim = .1),
median(x_9),
sd(x_9),
DescTools::MeanAD(x_9),
mad(x_9,constant = 1),
mad(x_9), # MADN
IQR(x_9))
explicit_est_x <- c(mean(x),
mean(x, trim = .1),
psych::winsor.mean(x = x, trim = .1),
median(x),
sd(x),
DescTools::MeanAD(x),
mad(x,constant = 1),
mad(x),
IQR(x)
)
kable(
data.frame(est_names, explicit_est_x_9, explicit_est_x),
digits = c(0, 3, 3),
escape = FALSE
)
| est_names | explicit_est_x_9 | explicit_est_x |
|---|---|---|
| \(\bar{x}\) | 9.966 | 10.490 |
| \(\bar{x}_{0.1}\) | 9.966 | 10.021 |
| \(\tilde{x}_{0.1}\) | 9.952 | 10.078 |
| \(Me(x)\) | 9.960 | 9.975 |
| \(\hat{\sigma}\) | 0.272 | 1.678 |
| \(MD(x)\) | 0.186 | 0.944 |
| \(MAD(x)\) | 0.120 | 0.145 |
| \(MADN(x)\) | 0.178 | 0.215 |
| \(IQR(x)\) | 0.170 | 0.228 |
Univariate M-estimators can efficiently be created by means of the
robustbase R package. It offers plenty of features (we will
see some more further on in the lab) to customize your M-estimator, by
even letting you define your own psiFunc, for constructing
psi_func objects. We are not dwelling into the details
about it, but if you are curious you can look up at the vignette
available online.
Let us stick with the two we have encountered in class, namely Huber and
Tukey’s. They can easily be plotted by means of:
source(system.file("xtraR/plot-psiFun.R", package = "robustbase", mustWork=TRUE))
p.psiFun(seq(-3,3,length.out=100), "huber", par = 1.5, leg.loc="bottomright", main="T")
p.psiFun(seq(-5,5,length.out=100), "biweight", par = 1.5, leg.loc="bottomright",main="T")
We have already seen how to hard-code the iterative algorithm described in slide 13, but with an additional (robust) scale estimate \(\hat{\sigma}\).
manual_M_location <-
function(x,
k,
type = c("Huber", "Tukey"),
tol = sqrt(.Machine$double.eps),
itermax = 1000) {
mu <- mu_old <- median(x) # initial value
rob_scale <- mad(x) # it will be kept fixed
crit <- TRUE
iter <- 0
weigth_f <- switch (type,
"Huber" = function(x) pmin(1, k/abs(x)),
"Tukey" = function(x) ((1-(x/k)^2)^2)*(abs(x)<=k)
)
while(crit){
w_i <- weigth_f((x-mu)/rob_scale)
mu <- weighted.mean(x = x,w = w_i)
err <- abs(mu-mu_old)
mu_old <- mu
iter <- iter+1
crit <- (err > tol & iter < itermax)
}
list(mu=mu, s=rob_scale,it=iter)
}
The robustbase package contains the huberM
function for performing M-Estimation of location with MAD scale by means
of Huber function
huberM(x = x,k = 1.5)
## $mu
## [1] 10.00333
##
## $s
## [1] 0.214977
##
## $it
## [1] 13
##
## $SE
## [1] NA
manual_M_location(x = x,k = 1.5,type = "Huber")
## $mu
## [1] 10.00333
##
## $s
## [1] 0.214977
##
## $it
## [1] 12
In order to employ other \(\psi\)-functions, as well as other
estimator types (S, MM, not covered in class), the nlrob is
a very flexible function for robust fitting. Honestly, I do not find it
straightforward to use, so we move directly to robust multivariate
estimators.
We consider a data frame with average brain and body weights for \(62\) species of land mammals and three other animal types.
data("Animals2")
# let us work with the natural logarithms of the weigths
Animals2$body <- log(Animals2$body)
Animals2$brain <- log(Animals2$brain)
plot(Animals2)
Does it look familiar? We have already seen a plot of these data
during the class. The aim is to derive robust location and scale
estimators employing the Minimum Covariance Determinant (MCD). There are
several ways of doing it in R, with different routines
available. We make use of the robustbase package, that
contains the “Essential” Robust Statistics tools allowing to analyze
data with robust methods.
The needed function is covMcd, whose main arguments
are:
x: a matrix of data pointsalpha: numeric parameter controlling the size of the
subsets over which the determinant is minimized; roughly
alpha*Nnsamp: number of subsets used for initial estimates or
a character vector specifying whether “best”, “exact”, or
“deterministic” estimation should be performedClearly, the most important hyper-parameter to be selected is
alpha, as it approximately determines the size of the
subsets used for parameter estimation, and the consequent BP of the
estimator. Let us set alpha=0.75.
fit_MCD <- covMcd(x = Animals2, alpha = .75, nsamp = "best")
fit_MCD
## Minimum Covariance Determinant (MCD) estimator approximation.
## Method: Fast MCD(alpha=0.75 ==> h=49); nsamp = best; (n,k)mini = (300,5)
## Call:
## covMcd(x = Animals2, alpha = 0.75, nsamp = "best")
## Log(Det.): 0.5867
##
## Robust Estimate of Location:
## body brain
## 1.292 3.074
## Robust Estimate of Covariance:
## body brain
## body 10.999 8.164
## brain 8.164 6.529
The fit_MCD object contains the raw and reweighted
estimates of location and scatter, the resulting robust Mahalanobis
distances and the best subset used for computing the raw estimates.
ind_best_subset <- fit_MCD$best
N <- nrow(Animals2)
p <- ncol(Animals2)
plot(Animals2, col=ifelse(1:N%in%ind_best_subset,"black","red"),pch=19)
Notice that the raw MCD estimate of scatter is by default multiplied by a consistency factor
\[ c(p, \alpha)=\frac{\alpha}{F_{\chi_{p+2}^{2}}\left(q_{p, \alpha}\right)} \] (\(q_{p, \alpha}\) is the \(\alpha\) level quantile of a \(\chi^2_p\) distribution) and a finite sample correction factor, to make it consistent at the normal model and unbiased at small samples.
dplyr::near(fit_MCD$raw.center,colMeans(Animals2[ind_best_subset,]))
## body brain
## TRUE TRUE
dplyr::near(fit_MCD$raw.cov,cov(Animals2[ind_best_subset,])*prod(fit_MCD$raw.cnp2))
## body brain
## body TRUE TRUE
## brain TRUE TRUE
h <- fit_MCD$quan
dplyr::near(fit_MCD$raw.cnp2[1],(h/N)/pchisq(qchisq(p = h/N,df = p),df = p+2))
## [1] TRUE
Notice that by default, unless specifyingraw.only=TRUE,
covMcd performs a reweighting step for improving the
efficiency of the final estimator. In details, the new weights are
computed as follows:
\[ W\left(d^{2}\right)=I\left(d^{2} \leqslant \chi_{p, 0.975}^{2}\right) \] where \(d^2\) is the squared Mahalanobis distance based on the scaled raw MCD.
ind_rew_obs <-
which(
mahalanobis(
x = Animals2,
center = fit_MCD$raw.center,
cov = fit_MCD$raw.cov
) <= qchisq(p = .975, df = p)
)
dplyr::near(fit_MCD$center,colMeans(Animals2[ind_rew_obs,]))
## body brain
## TRUE TRUE
dplyr::near(fit_MCD$cov,cov(Animals2[ind_rew_obs,])*prod(fit_MCD$cnp2))
## body brain
## body TRUE TRUE
## brain TRUE TRUE
robustbase provides also a nice plotting method that
lets us fully explore the set of graphical tools we have seen in class
in an almost automatic way:
plot(fit_MCD,classic=TRUE)
We see that the outliers do not influence the robust estimates: the same can only be partially said when employing the classical estimators.
Everything could have as always been hard-coded:
n <- nrow(Animals2)
sample_mean <- apply(Animals2, 2, mean)
sample_cov <- cov(Animals2)#*(n-1)/n # by default R computes the corrected sample variance
# MCD estimates
plot(Animals2, xlim=c(-10,15),ylim=c(-5,15))
lines(ellipse(x = fit_MCD$cov,centre=fit_MCD$center),lty=2, col="blue",type="l")
points(x=fit_MCD$center[1],y=fit_MCD$center[2], pch="x", col="blue", cex=2)
# ML estimates
lines(ellipse(x = sample_cov, centre=sample_mean),lty=2, col="red",type="l")
points(x=sample_mean[1],y=sample_mean[2], pch="x", col="red", cex=2)
# Robust estimates
plot(sqrt(fit_MCD$mah))
# Classical estimates
plot(sqrt(mahalanobis(x = Animals2,center = sample_mean, cov = sample_cov)))
Let us now consider a dataset containing measurements of \(p= 9\) characteristics of \(n = 677\) diaphragm parts used in the production of TV sets. Diaphragm are thin metal plates, molded by a press. The aim of the multivariate analysis is to gain insight into the production process and the interrelations between the nine measurements and to find out whether deformations or abnormalities have occurred and why.
data_philips <- readRDS(here::here("Block V - Robust statistics/data/data_philips.Rds"))
pairs(data_philips)
There is no clear visible pattern in the data, and no outliers seem to be present.
fit_MCD <- covMcd(x = data_philips, alpha = .75)
fit_MCD
## Minimum Covariance Determinant (MCD) estimator approximation.
## Method: Fast MCD(alpha=0.75 ==> h=510); nsamp = 500; (n,k)mini = (300,5)
## Call:
## covMcd(x = data_philips, alpha = 0.75)
## Log(Det.): -64.75
##
## Robust Estimate of Location:
## X1 X2 X3 X4 X5 X6 X7 X8
## -0.04092 -0.31689 -0.05079 0.43806 2.12311 0.43384 -0.10401 -0.06584
## X9
## -0.08844
## Robust Estimate of Covariance:
## X1 X2 X3 X4 X5 X6
## X1 0.0161197 0.0065964 0.0058713 0.0027880 0.0022950 1.563e-03
## X2 0.0065964 0.0113972 0.0018364 0.0007830 -0.0004515 6.220e-04
## X3 0.0058713 0.0018364 0.0068021 0.0016147 0.0022204 9.816e-04
## X4 0.0027880 0.0007830 0.0016147 0.0008569 0.0008314 3.597e-04
## X5 0.0022950 -0.0004515 0.0022204 0.0008314 0.0028981 8.686e-04
## X6 0.0015629 0.0006220 0.0009816 0.0003597 0.0008686 5.406e-04
## X7 0.0033217 0.0002781 0.0019223 0.0009564 0.0008422 2.860e-04
## X8 0.0002417 -0.0004847 0.0005638 0.0001670 0.0002286 1.742e-05
## X9 0.0022851 0.0013060 0.0014172 0.0004875 0.0007173 5.007e-04
## X7 X8 X9
## X1 0.0033217 2.417e-04 2.285e-03
## X2 0.0002781 -4.847e-04 1.306e-03
## X3 0.0019223 5.638e-04 1.417e-03
## X4 0.0009564 1.670e-04 4.875e-04
## X5 0.0008422 2.286e-04 7.173e-04
## X6 0.0002860 1.742e-05 5.007e-04
## X7 0.0013263 3.112e-04 4.385e-04
## X8 0.0003112 4.495e-04 5.354e-05
## X9 0.0004385 5.354e-05 6.273e-04
plot(fit_MCD, classic=TRUE, labels.id=FALSE, which="distance")
plot(fit_MCD,labels.id=FALSE,which=c("dd"))
Robust distances report a strongly deviating group of outliers, ranging from index \(491\) to index \(565\). The reason being that the process changed after the first \(100\) points, and between index \(491–565\) it was out of control.
Let us consider the Hertzsprung-Russell Diagram Data already encountered in the very first lab and consider the problem of regressing the logarithm of light intensity as a function of the logarithm of the effective temperature at the surface of the star.
We have already seen that the OLS estimator goes bananas
plot(starsCYG)
fit_lm <- lm(log.light~log.Te, data=starsCYG)
abline(fit_lm, col="red", lwd=2)
text(starsCYG$log.Te, starsCYG$log.light, 1:nrow(starsCYG), pos=1)
Yet, if we were to look at standard diagnostic plots we would not grasp how bad the situation actually is:
plot(fit_lm)
We would however if we fitted the Least Median of Squares (LMS) and the Least Trimmed Squares (LTS)
fit_lms <- lmsreg(log.light~log.Te, data=starsCYG)
# This is better than the previous
fit_lts <- ltsReg(log.light~log.Te, alpha=.75,mcd=TRUE,data=starsCYG) #ltsreg in the MASS package uses an older (slower) implementation of the LTS estimator
plot(starsCYG)
abline(fit_lm, col="red", lwd=2)
abline(fit_lms, col="darkblue", lwd=2)
abline(fit_lts, col="darkgreen", lwd=2)
legend("bottomleft", c('OLS', 'LMS', 'LTS'), lwd=rep(2,4), col=c("red", "darkblue", "darkgreen"))
As for the MCD estimator, robustbase provides plotting
methods for LTS
plot(fit_lts)
We consider \(40\) cases of a study on production waste and land use, originally published in Golueke, C.G. and McGauhey, P.H. (1970), Comprehensive Studies of Solid Waste Management. The response variable is solid waste (millions of tons), while the remaining explanatory variables are industrial land (acres), fabricated metals (acres), trucking and wholesale trade (acres), retail trade (acres) and restaurants and hotels (acres).
data("waste")
fit_lts <- ltsReg(SolidWaste~., alpha=.75,mcd=TRUE,data=waste)
summary(fit_lts)
##
## Call:
## ltsReg.formula(formula = SolidWaste ~ ., data = waste, alpha = 0.75,
## mcd = TRUE)
##
## Residuals (from reweighted LS):
## Min 1Q Median 3Q Max
## -0.17418 -0.04812 0.00000 0.02386 0.20312
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Intercept 1.335e-01 1.862e-02 7.171 6.81e-08 ***
## Land -2.734e-05 1.099e-05 -2.488 0.018850 *
## Metals 1.483e-04 9.418e-05 1.574 0.126290
## Trucking 5.132e-04 6.728e-05 7.627 2.08e-08 ***
## Retail -7.077e-04 3.915e-04 -1.808 0.080995 .
## Restaurants 8.031e-03 2.169e-03 3.702 0.000893 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08231 on 29 degrees of freedom
## Multiple R-Squared: 0.952, Adjusted R-squared: 0.9438
## F-statistic: 115.1 on 5 and 29 DF, p-value: < 2.2e-16
plot(fit_lts)